Duplicate detection in the Reuters collection 1

نویسنده

Mark Sanderson

چکیده

In a bibliographic database, the main task is not to find exact duplicate records, rather it is to find those that refer to the same work but differ in some manner. Differences are typically due to inaccurate or inconsistent data entry. One such detection method was developed by Ridley [Ridley 92] who adopted a two stage technique. First, all records in a database were assigned a number generated from a hashing function that used as its input, fields of a bibliographic record. Any records that had the same hashing number were examined in greater detail in the second stage. This entailed a comparison of fields by customised processes: i.e. the author field process looked for missing initials; the title field process looked for a missing suffix. Detection techniques of this kind are supported by the work of O’Neill et al. [O’Neill 93] who manually examined duplicate bibliographic records to find which fields were most likely to differ.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Duplicate detection in the Reuters collection

متن کامل

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...

متن کامل

Identification of MIR-Flickr Near-duplicate Images - A Benchmark Collection for Near-duplicate Detection

There are many contexts where the automated detection of near-duplicate images is important, for example the detection of copyright infringement or images of child abuse. There are many published methods for the detection of similar and near-duplicate images; however it is still uncommon for methods to be objectively compared with each other, probably because of a lack of any good framework in ...

متن کامل

Reuters test collection Saturday , 11 June , 1994

This short paper presents the little known Reuters 22,173 test collection, which is significantly larger than most traditional test collections. In addition, Reuters has none of the recall calculation problems normally associated with some of the larger test collections now available. This paper explains the method (derived from Lewis [Lewis 91]) used to perform retrieval experiments on the Reu...

متن کامل

Categorizing Gigabytes: Experiments on the RCV1 Corpus

This paper presents categorization results performed by means of HITEC categorizer tool on the new benchmark document collection of text categorization, the Reuters Corpus Volume 1 (RCV1). RCV1 is an archive of over 800,000 manually categorized newswire stories made available by Reuters in 2000 for research purposes. This collection was released to take place of the Reuters-21578 collection tha...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1997

Duplicate detection in the Reuters collection 1

نویسنده

چکیده

منابع مشابه

Duplicate detection in the Reuters collection

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Identification of MIR-Flickr Near-duplicate Images - A Benchmark Collection for Near-duplicate Detection

Reuters test collection Saturday , 11 June , 1994

Categorizing Gigabytes: Experiments on the RCV1 Corpus

عنوان ژورنال:

اشتراک گذاری